Added socket id env for multi GPU configuration #16

AMIYAMAITY · 2024-01-29T09:58:59Z

It's been a week and more. We are using the nvshare library in the kubernetes POD and managing the GPU memory across the processes/PODs. It's working fine without any issues. Thank you for making it open for everyone, It's quite useful for GPU based projects.

Now, we are trying to integrate it with a multi GPU. Basically In a node we have 2 GPUs, we are trying to integrate both the GPUs.

Regarding multi GPU I have experimented with a couple of things-On the repo I went through some issues for multi gpus configuration. Initially I have increased the scheduler and device plugins on the deployment files and deployed the nvhare-system. Then we have deployed our project on the node. As soon as we deployed we got a socket connection error.
After that I did a fork and added an environment variable NVSHARE_SOCK_ID It's basically taking the id of the GPU based on the ID it's creating/connected to the socket. This NVSHARE_SOCK_ID and NVIDIA_VISIBLE_DEVICES we have passed a couple places like scheduler, device-plugins and the container's. This approach is working fine without any issues. In multiple containers we can pass NVSHARE_SOCK_ID and NVIDIA_VISIBLE_DEVICES based on which GPU we want to process.
Below some screenshots for config and deployments-

Now we have a another use case- Let's assume In a container we are passing NVSHARE_SOCK_ID: "0" and NVIDIA_VISIBLE_DEVICES: "0" it means this container will access only the 0'th GPU only.
In one of the containers there is a single process and two threads are there. Each thread utilizes GPUs by cudaSetDevice() in cpp. In normal cases we have to pass NVIDIA_VISIBLE_DEVICES: "all" then we can access and map it. But along with nvshare how we will pass and manage the socket for NVSHARE_SOCK_ID.

Could you please go through my changes and help with the above problem?

jake-brewer-isa · 2024-03-07T14:42:00Z

kubernetes/manifests/device-plugin.yaml

        volumeMounts:
          - name: device-plugin-socket
            mountPath: /var/lib/kubelet/device-plugins
-        resources:


Did I miss resources moving somewhere else, or was this just not used?

jake-brewer-isa · 2024-03-07T14:42:17Z

kubernetes/manifests/scheduler.yaml

      containers:
      - name: nvshare-scheduler
-        image: docker.io/grgalex/nvshare:nvshare-scheduler-v0.1-f654c296
+        image: nvshare:nvshare-scheduler-a4301f8e


Is this local dev code

jake-brewer-isa · 2024-03-07T14:42:37Z

kubernetes/manifests/device-plugin.yaml

      containers:
      - name: nvshare-lib
-        image: docker.io/grgalex/nvshare:libnvshare-v0.1-f654c296
+        image: nvshare:libnvshare-a4301f8e


Is this local dev code?

jake-brewer-isa · 2024-03-07T14:43:40Z

kubernetes/device-plugin/server.go

 		mount = &pluginapi.Mount{
-			HostPath:      SocketHostPath,
-			ContainerPath: SocketContainerPath,
+			HostPath:      SocketHostPathNew,


I understand you want this to stand out especially on your test machine. Do we want this on the main branch?

jake-brewer-isa · 2024-03-07T14:46:01Z

kubernetes/device-plugin/server.go

 	serverSock   = pluginapi.DevicePluginPath + "nvshare-device-plugin.sock"
 )

+


minor nit-pick - might not hurt to run this through a code formatter.

jake-brewer-isa · 2024-03-07T14:49:52Z

kubernetes/device-plugin/server.go

 	return &NvshareDevicePlugin{
 		devs:   getDevices(),
-		socket: serverSock,
+		socket: serverSockNew,


I see why you did this, Could the name be something more meaningful? The next person might add another "New" on the end.

I suggest ServerSockWithId or ServerSockId

jake-brewer-isa · 2024-03-07T14:50:37Z

kubernetes/device-plugin/server.go

+	if len(socketId) == 0 {
+		socketId = "0"
+	}
+	resourceNameNew:= resourceName+socketId


resourceNameWithId?

jake-brewer-isa · 2024-03-07T14:52:13Z

kubernetes/manifests/device-plugin.yaml

+        imagePullPolicy: IfNotPresent
+        env:
+        - name: NVSHARE_VIRTUAL_DEVICES
+          value: "10"


Will this work for a 12 GPU machine? It looks like existing code that moved. Even this were changed to 12, would this add value on a 12 GPU machine? Anyway, I entered an issue since this looks like a separate issue than the multi-GPU support.

From reviewing code not live experience: Hard coded to 10 virtual GPU's?

#17

jake-brewer-isa · 2024-03-07T15:02:19Z

kubernetes/device-plugin/server.go

Disclaimer I'm familiar with C# and JavaScript and have some ancient experience with Java, and C/C++; this is my first time seeing go. I am too excited to see this in production to leave it untouched.

I love this change and look forward to using it! If someone has already tested this on a single GPU machine, then I vote for an approval.

Minor Nit-picks:

variableNameNew could be more descriptive like variableNameWithId or **variableNameId"

Overall, the code is well formatted and easy to read, but there are a few things an automatic code formatter might improve.

grgalex · 2024-03-07T15:32:00Z

Today I learned that by default anyone can do a code review on a PR made to a public repo.

I'll take a look at this when I get a chance.

However, just dropping a PR is not the correct way of doing things.

Some open-source repos do that, but usually they just end up piling shit on top of shit.

Each PR must have a backing issue that it addresses, and each commit should either ref (#Refs) or close (#Closes) that issue. A commit message must describe why changes are made, and what they do.

See issue #11 and the associated PR for a real example of the flow I'm describing.

Additionally, no commit should break the software, i.e., anyone should be able to build from source or deploy from the manifests for any given commit.

In this case:

@AMIYAMAITY should open an issue, describe the problem he wants to address and provide a plan, which we can discuss in the issue.
When we agree on a specific set of discrete actions, someone will implement them and submit a PR
We will review the implemented changes in the PR, make small adjustments, and finally I will merge them :)

So, @AMIYAMAITY can you open an issue, describing the problem you want to address, along with a suggested plan to address it?

@jake-brewer-isa Thanks for attempting to review, but it's too early for that.

AMIYAMAITY added 2 commits January 29, 2024 13:01

Socket id env added for multi GPU configuration

a4301f8

Update socket id for device plugin and update the manifests

2e9cef9

jake-brewer-isa reviewed Mar 7, 2024

View reviewed changes

AMIYAMAITY mentioned this pull request Mar 13, 2024

Question regarding multi-GPU access on a single docker container on the k8s node. #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added socket id env for multi GPU configuration #16

Added socket id env for multi GPU configuration #16

Uh oh!

AMIYAMAITY commented Jan 29, 2024

Uh oh!

jake-brewer-isa Mar 7, 2024

Uh oh!

jake-brewer-isa Mar 7, 2024

Uh oh!

jake-brewer-isa Mar 7, 2024

Uh oh!

jake-brewer-isa Mar 7, 2024

Uh oh!

jake-brewer-isa Mar 7, 2024

Uh oh!

jake-brewer-isa Mar 7, 2024

Uh oh!

jake-brewer-isa Mar 7, 2024

Uh oh!

jake-brewer-isa Mar 7, 2024 •

edited

Loading

Uh oh!

jake-brewer-isa Mar 7, 2024

Uh oh!

grgalex commented Mar 7, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		serverSock = pluginapi.DevicePluginPath + "nvshare-device-plugin.sock"
		)

Added socket id env for multi GPU configuration #16

Are you sure you want to change the base?

Added socket id env for multi GPU configuration #16

Uh oh!

Conversation

AMIYAMAITY commented Jan 29, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jake-brewer-isa Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Disclaimer I'm familiar with C# and JavaScript and have some ancient experience with Java, and C/C++; this is my first time seeing go. I am too excited to see this in production to leave it untouched.

Uh oh!

grgalex commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jake-brewer-isa Mar 7, 2024 •

edited

Loading

grgalex commented Mar 7, 2024 •

edited

Loading